🔬 Simulation-Based Methods versus Theory-Based Methods

<<<<<<< HEAD:_site/slides/week8-day2.html
=======

Lab 9 – One-Way ANOVA

Author

Your group’s names here!

Published

June 2, 2023

>>>>>>> 480a4b3c6f7f14f0c08a28ce007e781ac293a1c0:docs/labs/lab-9.html
library(tidyverse)
library(infer)
library(ggridges)
library(broom)

Today’s Data

These data come from the Gapminder Foundation, an organization interested in increasing the use and understanding of statistics and other information about social, economic and environmental development at local, national and global levels.

Today we will be comparing math achievement scores across continents and years. Math achievement was measured for 42 countries based on their average score for the grade 8 international TIMSS test.

math_scores <- read_csv(here::here("labs", 
                                   "data",
                                   "math_scores.csv")
                        )

# Creating a year_cat variable that is the categorical version of year
math_scores <- mutate(math_scores, 
                      year_cat = as.factor(year)
                      )

# Removing the missing values from the grade_8_math_score variable
math_scores <- drop_na(data = math_scores, 
                       grade_8_math_score)

Data Visualizations

The first step for a statistical analysis should always be creating visualizations of the data. Similar to what you are expected to do for your project, you will make three density ridge plots:

  • visualizing the relationship between math score and year
  • visualizing the relationship between math score and continent
  • visualizing the relationship between math score with both year and continent
<<<<<<< HEAD:_site/slides/week8-day2.html

Plan for Week 9

  • Asynchronous class on Tuesday and Thursday

  • Typical deadlines for reading (Tuesday) and tutorial (Thursday)

  • “Checkpoints” for Final Project incorporated throughout the week

    • Introduction – Due Wednesday
    • Methods – Due Friday
    • Findings & Scope of Inference – Due Sunday

Some advice on the your Final Project…

What did we do on Tuesday?

We carried out a hypothesis test!

\[H_0: \beta_1 = 0\]

\[H_A: \beta_1 \neq 0\]

What do these hypotheses mean in words?

By creating a permutation distribution!

null_dist <- evals %>% 
  specify(response = score, 
          explanatory = bty_avg) %>% 
  hypothesise(null = "independence") %>% 
  generate(reps = 1000, type = "permute") %>% 
  calculate(stat = "slope")


What is happening in the generate() step?

And visualizing where our observed statistic fell on the distribution

What would you estimate the p-value to be?

And calculated the p-value

get_p_value(null_dist, 
            obs_stat = obs_slope, 
            direction = "two-sided")
# A tibble: 1 × 1
  p_value
    <dbl>
1       0



What would you decide for your hypothesis test?

How would this process have changed if we used theory-based methods instead?

Approximating the permutation distribution

A \(t\)-distribution can be a reasonable approximation for the permutation distribution if certain conditions are not violated.

What about the observed statistic?

obs_slope <- evals %>% 
  specify(response = score, 
          explanatory = bty_avg) %>% 
  calculate(stat = "slope")
Response: score (numeric)
Explanatory: bty_avg (numeric)
# A tibble: 1 × 1
    stat
   <dbl>
1 0.0666
evals_lm <- lm(score ~ bty_avg,
               data = evals)

get_regression_table(evals_lm)
# A tibble: 2 × 7
  term      estimate std_error statistic p_value lower_ci upper_ci
  <chr>        <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
1 intercept    3.88      0.076     51.0        0    3.73     4.03 
2 bty_avg      0.067     0.016      4.09       0    0.035    0.099

How did R calculate the \(t\)-statistic?

\(SE_{b_1} = \frac{\frac{s_y}{s_x} \cdot \sqrt{1 - r^2}}{\sqrt{n - 2}}\)

[1] 0.01495204

\(t = \frac{b_1}{SE_{b_1}}\)

bty_avg 
4.45672 
# A tibble: 2 × 7
  term      estimate std_error statistic p_value lower_ci upper_ci
  <chr>        <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
1 intercept    3.88      0.076     51.0        0    3.73     4.03 
2 bty_avg      0.067     0.016      4.09       0    0.035    0.099

How does R calculate the p-value?

How many degrees of freedom does this \(t\)-distribution have?

Did we get similar results between these methods?

Why not always use theoretical methods?

Theory-based methods only hold if the sampling distribution is normally shaped.

The normality of a sampling distribution depends heavily on model conditions.

What are these “conditions”?

For linear regression we are assuming…

Linear relationship between \(x\) and \(y\)


Indepdent observations


Normality of residuals


Equal variance of residuals

Linear relationship between \(x\) and \(y\)

What should we do?

Variable transformation!

Independence of observations

The evals dataset contains 463 observations on 94 professors. Meaning, professors have multiple observations.


What can we do?

Best – use a random effects model

Reasonable – collapse the multiple scores into a single score

Normality of residuals

What should we do?

Variable transformation!

Equal variance of residuals

What should we do?

Variable transformation!

Are these conditions required for both methods?

Simulation-based Methods

=======

Question 1 – Fill in the code below to visualize the distribution of grade 8 math scores over time.

Don’t forget to include axis labels!

ggplot(data = math_scores, 
       mapping = aes(x = ____, 
                     y = ____)) +
  geom_density_ridges(scale = 1) 

Note: I’ve included a scale = 1 argument to show you how you can get the density plots not to overlap!

Question 2 – What do you see in the plot you made? How do the centers (means) of the distributions compare? What about the variability (spread) of the distributions?

Question 3 – Write the code to visualize the distribution of grade 8 math scores for the six different continents.

Don’t forget to include axis labels!

Question 4 – What do you see in the plot you made? How do the centers (means) of the distributions compare? What about the variability (spread) of the distributions?

Question 5 – Write the code to visualize the distribution of grade 8 math scores for the six different continents for each of the four years.

Remember, you could either include a facet or a color here!Also remember you can use alpha to change the transparency of your density ridges!

Question 6 – What do you see in the plot you made? Does it seem that the relationship between year and grade 8 math scores changes based on the continent of the student?

Statistical Model

For our analysis we will be using an analysis of variance (ANOVA) model. An ANOVA is an appropriate statistical model as we have a continuous response variable (grade 8 math score) and categorical explanatory variables (year, continent). Year is not considered to be a continuous numerical variable as we have only four measurements in time (1996, 1999, 2003, 2007).

Model Conditions

An ANOVA has model conditions that are very similar to what we learned for linear regression. In this section we will evaluate the conditions of the model.

For this section, it might be helpful to know how many observations there are for each year and for each continent. I have written code below to provide you with a table of these numbers:

count(math_scores, continent, year) %>% 
  pivot_wider(names_from = continent, 
              values_from = n, 
              values_fill = 0) %>% 
  janitor::adorn_totals(where = c("row", "col"))

Independence

Based on the table we know:

>>>>>>> 480a4b3c6f7f14f0c08a28ce007e781ac293a1c0:docs/labs/lab-9.html
  • each year has measurements on about six continents
  • each continent has measurements for about four years
<<<<<<< HEAD:_site/slides/week8-day2.html

Theory-based Methods

  • Linearity of Relationship
  • Independence of Observations
  • Normality of Residuals
  • Equal Variance of Residuals

What happens if the conditions are violated?

In general, when the conditions associated with these methods are violated, the permutation and \(t\)-distributions will underestimate the true standard error of the sampling distribution.

=======

Use this information to evaluate the condition of independence of observations.

Question 7 – Is it reasonable to assume that the observations within a continent are independent of each other?

Question 8 – Is it reasonable to assume that the observations within a year are independent of each other?

Question 9 – Is it reasonable to assume that the observations between continents are independent of each other?

Question 10 – Is it reasonable to assume that the observations between a years are independent of each other?

Normality

Now we will evaluate the normality of the the distributions of grade 8 math scores across years and across continents – the plot you created in #5. Keep in mind, the normality condition is very important when the sample sizes for each group are relatively small.

Question 11 – Is it reasonable to say that the grade 8 math scores across the four years and six continents are normally distributed?

Equal Variance

Now we will evaluate the normality of the the distributions of grade 8 math scores across years and across continents – the plot you created in #5. Keep in mind, the constant variance condition is especially important when the sample sizes differ between groups.

For this section, it might be helpful to know the standard deviations for each year / continent combo. I have written code below to provide you with a table of these numbers:

Keep in mind a standard deviation of NA can happen for two reasons, (1) there is no data, or (2) there is only one observation.

math_scores %>% 
  group_by(year, continent) %>% 
  summarize(var = var(grade_8_math_score, na.rm = TRUE)
            ) %>% 
  pivot_wider(names_from = continent, values_from = var)

Looking at the table, we can see that the largest variance of 10257 (North America, 2007) is nearly 27 times larger than the smallest variance of 381 (Europe, 2003). That’s a lot! So, our equal variance condition is definitely violated.

But, we have learned tools to attempt to remedy this issue! Let’s take the log of grade_8_math_score and see how the variances compare.

math_scores %>% 
  group_by(year, continent) %>% 
  summarize(log_var = var(log(grade_8_math_score))
            ) %>%
  pivot_wider(names_from = continent, values_from = log_var)

Question 12 – Based on the variances in the table above, is it reasonable to say that the log grade 8 math scores across the four years and six continents have equal variability?

One-Way ANOVA Inference

We are going to test out both methods for conducting a hypothesis test for an ANOVA – theory-based and simulation-based methods. Keep in mind both methods require independence of observations and equal variability. Normality, however, is only a condition of theory-based methods.

Testing for a Difference Between Years

Since the distribution of grade 8 math scores across the four years wasn’t horribly not Normal, let’s give a theory-based method a try.

Question 13 – Fill in the code below to conduct a one-way ANOVA modeling the relationship between mean grade 8 math score and the year

Keep in mind the response variable comes first and the explanatory variable comes second!

aov(____ ~ ____, data = math_scores) %>% 
  broom::tidy()

Question 14 – At an \(\alpha = 0.1\), what decision would you reach for your hypothesis test?

Question 15 – What would you conclude about the relationship between the mean grade 8 math scores and year?

Testing for a Difference Between Continents

Since the distribution of grade 8 math scores across the six continents didn’t look very Normal, so let’s give a simulation-based method a try.

I’ve gotten you started by calculating the observed F-statistic for the relationship between a country’s grade 8 math score and its continent.

obs_F <- math_scores %>% 
  specify(response = grade_8_math_score, 
          explanatory = continent) %>% 
  calculate(stat = "F")

Question 16 – Write the code to generate a permutation distribution of resampled F-statistics.

Question 17 – Visualize the null distribution and shade how the p-value should be calculated

Keep in mind you only look at the right tail for an ANOVA!

Question 18 – Calculate the p-value for the observed F-statistic

Question 19 – At an \(\alpha = 0.1\), what decision would you reach for your hypothesis test?

Question 20 – What would you conclude about the relationship between the mean grade 8 math scores and continent?

>>>>>>> 480a4b3c6f7f14f0c08a28ce007e781ac293a1c0:docs/labs/lab-9.html
<<<<<<< HEAD:_site/slides/week8-day2.html